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Predicting ambulance demand accurately in line resolutions in 
space and time is critical for ambulance fleet management and dy¬ 
namic deployment. Typical challenges include data sparsity at high 
resolutions and the need to respect complex urban spatial domains. 

To provide spatial density predictions for ambulance demand in Mel¬ 
bourne, Australia as it varies over hourly intervals, we propose a pre¬ 
dictive spatio-temporal kernel warping method. To predict for each 
hour, we build a kernel density estimator on a sparse set of the most 
similar data from relevant past time periods (labeled data), but warp 
these kernels to a larger set of past data irregardless of time periods 
(point cloud). The point cloud represents the spatial structure and 
geographical characteristics of Melbourne, including complex bound¬ 
aries, road networks, and neighborhoods. Borrowing from manifold 
learning, kernel warping is performed through a graph Laplacian of 
the point cloud and can be interpreted as a regularization towards, 
and a prior imposed, for spatial features. Kernel bandwidth and de¬ 
gree of warping are efficiently estimated via cross-validation, and can 
be made time- and/or location-specific. Our proposed model gives sig¬ 
nificantly more accurate predictions compared to a current industry 
practice, an unwarped kernel density estimation, and a time-varying 
Gaussian mixture model. 


1. Introduction. A primary goal of emergency medical services (EMS) 
is to minimize response times to life-threatening emergencies while keeping 
operational costs low. Accurate spatial-temporal ambulance demand predic¬ 
tions are crucial to optimal operations management of base location, staff, 
fleet, and deployment. These demand predictions are ideally needed at high 
temporal and spatial granularities. The industry typically predicts for every 
hour and every 1 km^ region. We are motivated to predict this demand for 
the city of Melbourne, Australia. 

There are several typical challenges to predicting ambulance demand. 

• Ambulance demand is often exceedingly sparse at the temporal and 
spatial resolution required for prediction. There is zero demand in the 
vast majority of 1-km^ regions over a 1-hour period. 

Keywords and phrases: emergency medical service, kernel density estimation, manifold 
learning, graph Laplacian 
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• This demand arises from complex urban geography. The city boundary 
is often highly irregular. Ambulance demand can be very high (coastal 
and downtown) or very low (suburbs) along the boundary. Within this 
boundary, demand follows closely the city’s infrastructure and terrain; 
there might be high demand along central highways and zero demand 
within an internal lake. High resolutions covariates of these features 
are not readily available. 

• Ambulance demand exhibits spatial and temporal patterns. Weekly 
seasonality is usually prominent (Channouf et ah, 2007; Matteson 
et ah, 2011); the industry relies heavily on this seasonality to make pre¬ 
dictions. Some studies have also noted daily seasonality and short-term 
serial dependence at densely-populated regions (Zhou et ah, 2015). 

• Ambulance demand data for large cities is often large-scale. This presents 
computational challenges, especially since predictions are needed very 
frequently. 

It is particularly difficult to simultaneously resolve these challenges. Over¬ 
coming sparsity requires considerable smoothing, while capturing complex 
spatio-temporal patterns requires fine-resolution modeling. At high granu¬ 
larities, data sparsity makes it difficult to detect spatio-temporal character¬ 
istics accurately. At low granularities, differences across regions and times 
are not sufficiently captured for optimal ambulance planning. 

Figure 1 demonstrates these challenges in predicting ambulance demand 
for Melbourne. We show on the right the locations of 696, 975 demand inci¬ 
dents from years 2011 and 2012 (in gray), and those of 38 demand incidents 
for a typical 1-hour period (in black). On average, 99.6% of the 1-km^ re¬ 
gions in Melbourne receive zeros calls in any given hour. Comparing to a 
map of Melbourne on the left (Google Maps, 2015), we observe a highly com¬ 
plex spatial boundary as Melbourne encloses a large bay to its south-west. 
Demand is high near the bay, but low on the outskirt suburbs. Demand is 
visibly higher at small satellite suburban neighborhoods and along major 
highways radiating out from the city center. There is lack of demand due to 
several reservoirs and a national park to the west and northwest. Consistent 
with typical patterns, the demand exhibits strong weekly seasonality. 

The EMS industry and previous studies have attempted to address some 
of these challenges. The current industry practice uses a simple averaging 
formula. Demand in a 1-km^ spatial region over an hour is typically predicted 
by averaging a small number of historical counts, from the same spatial 
region, over the corresponding hours from previous weeks or years (Goldberg, 
2004). For instance, the EMS of Charlotte-Mecklenburg, North Carolina uses 
a method called MEDIC, in which the prediction is the average of twenty 
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Fig 1: Left: map of Melbourne (Google Maps, 2015); right: spatial locations 
of all 696,975 Melbourne ambulance demand incidents from years 2011 - 
2012 (in gray), and 38 demand incidents for a typical 1-hour period (in 
black). We observe complex boundary and geographical features (e.g., high¬ 
ways, roads, satellite suburbs). 


corresponding counts in the same hour of the preceding four weeks for the 
past five years (Setzler, Saydam and Park, 2009). Averaging so few historical 
counts, which are mostly zeros, produces noisy and flickering predictions, 
resulting in haphazard and inefficient deployment. 

Much attention has been given to predicting the aggregate ambulance 
demand as a temporal process, using autoregressive moving average models 
(Channouf et ah, 2007), factor models (Matteson et ah, 2011), and spectral 
analysis (Vile et ah, 2012). Few studies have modeled spatio-temporal am¬ 
bulance demand well. Setzler, Saydam and Park (2009) use artificial neural 
networks, but fail to improve over the industry method. A recent study by 
Zhou et ah (2015) predicts ambulance demand for Toronto, Canada using a 
time-varying Gaussian mixture model (GMM). This method is more accu¬ 
rate than the industry practice, but, as the authors point out, extending it 
to incorporate spatial boundaries would be prohibitively expensive. While 
it may not be essential for Toronto since the city is almost rectangular 
in shape, it becomes important for Melbourne; it is difficult for ellipsoidal 
Gaussian components to model demand well on the highly complex spa¬ 
tial domain of Melbourne. Another study by Zhou and Matteson (2015) 
considers a spatio-temporal weighted kernel density estimation (KDE) to 
predict Toronto’s ambulance demand. It gives similarly accurate predictions 
as GMM (both much better than industry), showing promise for KDE. 

KDE has been used to analyze spatio-temporal data such as crime inci¬ 
dence (Nakaya and Yano, 2010), disease spread (Zhang et ah, 2011), and data 
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streams (Aggarwal, 2003). It allows for rapid identification of “hotspots” and 
their evolutions in time and space. However, implementing a naive KDE is 
not satisfactory for our application. The chosen bandwidth necessarily has 
to be quite large given the data sparsity, smoothing inappropriately across 
boundary features and disregarding the underlying urban geography. 

Few studies have focused on modeling spatial or spatio-temporal point 
processes on complex spatial structures. Most studies assume a boundary 
defined a priori (polygon or pixelated). If not, ad hoc methods based on the 
convex hull of all observed points are typically used (Ripley and Rasson, 
1977). This invariably results in a convex boundary that may be inaccurate 
where data is sparse. Even with a boundary optimally defined, few methods 
are equipped to handle complex boundary features. Ramsay (2002) proposes 
a finite window smoother with known boundary conditions computed using 
an expensive finite element approach. Building on that. Wood, Bravington 
and Hedley (2008) model the boundary condition as a loop of wire and the 
point process as a soap film suspended from the boundary wire. They repre¬ 
sent this smoother as a penalized basis, compute it via multigrid, and select 
smoothness via generalized cross-validation. They acknowledge the lack of 
an elegant solution when the boundary conditions are unknown. Apart from 
boundary, other spatial characteristics, such as neighborhood structures and 
road networks, are rarely incorporated in modeling. We propose a method 
that can efficiently capture and exploit a wide range of spatial characteris¬ 
tics. We draw from theory and methods developed in manifold learning. 

Manifold learning, a branch of machine learning, is concerned with esti¬ 
mating and exploiting the underlying structures of data. The assumption is 
that data in a high-dimensional space resides on or near a lower-dimensional 
sub-manifold. In practice, we do not have access to this sub-manifold, but 
we can approximate it from a point cloud, i.e., a mass of historical data. 
The most common method is to construct an adjacency graph of this point 
cloud and make use of the properties and structures of this graph. This idea 
has led to many popular learning methods, including isomap (Tenebaum, 
de Silva and Langford, 2000), local linear embedding (Roweis and Saul, 
2000), and Laplacian eigenmaps (Belkin and Niyogi, 2003). These meth¬ 
ods were initially designed for data representation or visualization, but have 
been adapted for semi-supervised classification (Belkin and Niyogi, 2004), 
and clustering (Ng, Jordan and Weiss, 2001; Shi and Malik, 2000). 

In particular, a variant of Laplacian eigenmaps, kernel warping, has been 
proposed for semi-supervised classification (Smola and Kondor, 2003; Belkin 
and Niyogi, 2004; Sindhwani, Niyogi and Belkin, 2005). Using a small num¬ 
ber of labeled data and a larger number of labeled and unlabeled point 
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cloud data, the method classifies new examples by constructing kernels on 
the labeled data that warp to the geometry of the point cloud. This geom¬ 
etry is represented by the adjacency graph of the point cloud. Smoothing 
orthogonal to this geometry is penalized heavily, whereas smoothing along 
this geometry is not. This method is designed for high-dimensional classifi¬ 
cation, and has good performance on text and image data. 

Drawing from this idea, we propose a novel method for modeling spatio- 
temporal point processes against complex spatial structures and features. 
To predict ambulance demand for a future time period, we have a sparse set 
of historical data that is very relevant for this prediction (labeled data). We 
fit a KDE on them, but warp the kernels to a larger set of historical data 
regardless of their relevance to this predictive task (point cloud). This point 
cloud describes our belief about the spatial structure on which the labeled 
data lies. It captures exterior and interior boundaries without needing to 
explicitly define boundaries and boundary conditions. It also incorporates a 
wide range of complex spatial similarities and discontinuities, such as roads, 
city blocks, and neighborhoods of varying shapes and densities. Intuitively, 
this warping can be thought of as a regularization that penalizes radical 
departure from and encourages flow of information along our intuition of 
the geography. In a Bayesian sense, it can also be thought of as imposing a 
prior based on how similar or different the point process is across different 
locations. Such a regularization or prior is especially beneficial when the 
labeled data is sparse. We select the kernel bandwidth and the degree of 
warping efficiently via cross-validation. Both of these parameters can be 
made time- and/or location-specific. 

We implement this method on ambulance demand data from Melbourne 
in years 2011 and 2012. Altogether there are 696,975 realized events. Each 
event contains the time and location that the ambulance was dispatched 
to. The proposed kernel warping model gives significantly more accurate 
predictions than previous approaches, including the MEDIC method as an 
industry practice, unwarped KDE, and GMM. 

We develop the kernel warping model in Section 2. We construct an un¬ 
warped KDE in 2.1, warp the kernels to the point cloud in 2.2, and allow 
for time- and location-specific warping in 2.3 for the Melbourne data. Some 
details on computation are included in 2.4. We show the empirical results 
for predicting Melbourne ambulance demand in Section 3, and conclude in 
Section 4. 

2. Model. We model Melbourne’s ambulance demand on a continuous 
spatial domain 5 C IR,^ and a discretized temporal domain of one-hour 


6 


ZHOU AND MATTESON 


intervals T = {1,2,...}. Let st^i be the location of the i-th ambulance 
demand arising from the t-th time period, for i G {l,...,nt}, where nt 
is the total number of ambulances demanded in the t-th period. Since a 
non-homogeneous Poisson process (NHPP) is a natural model for spatial 
point process (Diggle, 2003; Mpller and Waagepetersen, 2004), we assume 
{st^i : i = 1,... nt} for each time period t independently follow an NHPP over 
S, with positive intensity function A*. We decompose the intensity function 
as At(s) = 5t/t(s), for s £ S. Here, = Js ds is the aggregate demand 
intensity over the spatial domain, and /*(•) is the continuous spatial density 
of the demand at time t such that /i(s) > 0 and ft{s)ds = 1. Therefore, 

for each t, nt\Xt ~ Poisson(<5t) and St^i\Xt,nt /*(•) for iejl,... ,nt}. The 
usual practice is to model {(5t} and {ft} separately. As mentioned before, 
numerous studies have proposed sophisticated and accurate methods for 
estimating We thus focus on predicting the spatio-temporal demand 
density {ft}, which is more challenging and less studied. 

2.1. Spatio-temporal KDE. Suppose we want to predict Melbourne’s am¬ 
bulance demand for a future 1-hour period u. Given the prominent weekly 
seasonality, the most relevant observations are from the corresponding hour 
of the week for the past M weeks. They constitute the labeled data for this 
predictive task. This approach is aligned with the industry practice, and is 
shown to work well in Zhou et al. (2015). We choose the sliding window 
width M a priori. With a larger M, we have more training data, but each 
training is slower and less adaptive to recent changes in demand patterns 
(e.g., summer vs. winter). The industry and recent studies have considered 
M between 4 and 8. For Melbourne, we set M = 8, resulting in an average 
labeled data size of about 300 points (ranging from 100 to 450 for different 
periods). Let Tu = {u — 168m : m £ {1,...,M}} denote the set of labeled 
time periods, in which 168 is the number of 1-hour periods in a week. 

Starting with a simple KDE on the labeled data, we predict for any x G 5, 

1 nt 

(1) fu{^) = ^ -— ^ ^ A;(x, I LJ). 

t£Tu i=l 

Here, k is the chosen bivariate spatial kernel with bandwidth matrix H.We 
use the Gaussian kernel, and choose bandwidth H via the plug-in method 
(Wand and Jones, 1994) or smoothed cross-validation (Duong and Hazelton, 
2005). When data show large variations in density, using one fixed bandwidth 
may not be optimal (Cacoullos, 1966; Scott, 1992). A bandwidth too large 
wipes out local features where we have sufficient data; a bandwidth too small 
leads to spurious peaks where data is sparse. In the case of Melbourne, data 
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density varies substantially in space (downtown vs. neighborhoods) and time 
(midnight vs. rush hours); we may be motivated to consider a spatial- and/or 
time-varying H. 

2.2. Kernel warping. We would like to warp each kernel k in Equation 
(1) to a larger set of point cloud data that describes the spatial boundary 
and characteristics of Melbourne. We choose the point cloud data, construct 
an adjacency graph on the point cloud, define the graph Laplacian matrix, 
and warp the kernel to this Laplacian matrix. We discuss in detail each step. 
Step 1 [Choosing the point cloud]: Typically in Laplacian eigenmap 
and kernel warping applications, all labeled and unlabeled data is used as 
the point cloud. In the context of spatial statistics and our application, there 
are several points of consideration: 

(a) Which points? We consider all observations in the near past, regardless 
of the time period. If we use the same sliding window width of M = 8 
previous weeks, we are choosing from about 50,000 points. 

(b) How many points? There is a trade-off: using more points in the cloud 
leads to better approximation of the geography but slower computa¬ 
tion. Since we are in a low-dimensional space of IR,^, we may not need 
a very large number of points to depict the most salient boundary 
and spatial structures. In our application, we find 1000 spatial points 
represent Melbourne’s geography reasonably well. 

(c) Points or mesh? Alternative to using past observations, we can also 
use past data to define a pixelated spatial domain of Melbourne and 
use the included pixels as the point cloud. Doing so we lose some reso¬ 
lution and information on data density, but may gain computationally 
if it can reduce the number of point cloud data significantly. A reg¬ 
ularly spaced point cloud also induces a sparse, band-diagonal graph 
Laplacian matrix (to be discussed later), leading to further savings. 

(d) Global or local? We can have one global point cloud for the entire 
spatial domain. We can also discretize the spatial domain into several 
regions with separate local point clouds. Local point clouds can provide 
computational advantages if they are smaller. They may also offer 
accuracy advantages if they depict finer-grain characteristics or allow 
for customized degree of warping at each locale. We discuss this further 
in Section 2.3. 

In our application, we randomly sample 1000 historical observations as 
the point cloud for each “component” (to be explained in Section 2.3). We 
denote the set of point cloud data as {zj} for i G {1,..., Z}. See Figure 2 
(a) for an example cloud of 1000 points over the entire city of Melbourne. 
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For our application, we find that predictive accuracy is not sensitive to the 
random sampling of the point cloud data. If it were, a larger point cloud 
might be needed, or predictions might be repeated and averaged over several 
point cloud samples. 

Step 2 [Constructing the adjacency graph] : We construct a graph with 
nodes at each point in the point cloud and edges connecting points that 
are close. We represent this graph using a symmetric, positive semidefinite 
adjacency matrix A. 

(a) Which nodes to connect? Knowledge about the spatial domain (e.g., 
inside a building vs outside) or regularity of the point cloud (e.g., 
regular mesh) may inform a natural way to dehne how nodes should 
be connected. Without such knowledge, we can connect nodes Zi and 
Zj if Zi is among the n nearest neighbors of Zj or Zj is among the 
n nearest neighbors of Zi (symmetric relation). This requires us to 
choose n. In our experience, n should be big enough to ensure that the 
point cloud is sufficiently connected instead of being very fragmented, 
but small enough to emphasize local relationships. A second way is 
to connect nodes if the (Euclidean) distance between them is smaller 
than a threshold. This requires us to choose the threshold. 

(b) Weighted edges? In the simplest case, we can set Aij = 1 if nodes 

and Zj are connected and 0 otherwise. Another idea suggested in 
Belkin and Niyogi (2003) is to dehne weighted edges depending on 
the distance between points, i.e., Aij = exp{ —||2:j — Zj\\‘^/r} if Zi and 
Zj are connected and 0 otherwise. The authors note that they do not 
have a principled way of choosing r; we hnd it reasonable to choose 
r empirically by htting an exponential distribution on all distances 
between connected nodes. They also note that in practice a binary 
adjacency graph works well, and we agree. 

In the Melbourne application, we use n = 5 nearest neighbors and binary 
weights to construct A. Figure 2 (a) shows the adjacency graph of a sample 
point cloud of size 1000. Again, we hnd our predictive accuracy to be insen¬ 
sitive towards any reasonable variations in these choices. 

Step 3 [Constructing the Laplacian matrix]: The graph Laplacian 
matrix L is dehned to be L = D — A, in which D is the diagonal degree 
matrix, with its diagonal entries being the column (or equivalently, row) 
sum of A, i.e.. Da = J2j ^ij- T is a symmetric, positive semidehnite matrix. 
If the graph has multiple connected components, L can be rearranged into a 
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block diagonal matrix, where each block is the respective Laplacian matrix 
for each connected component. 

Here is the intuition of the Laplacian matrix. The (discrete) point cloud 
adjacency graph is an empirical approximation to our target (continuous) 
manifold of Melbourne geography. The (discrete) graph Laplacian matrix L 
is then an approximation to the (continuous) Laplace-Beltrami operator on 
this manifold. The Laplace-Beltrami operator is a manifold generalization 
of the Laplace operator, which is a linear second order differential operator 
on functions (in our case, kernels). This L induces a semi-norm on kernels 
which penalizes changes between adjacent nodes. There is a close analogy 
to heat flow; the heat (partial differential) equation has a Laplace operator 
in space. Intuitively, L guides how information (heat) spreads on the spatial 
structure (manifold approximated by graph) from any initial KDE (initial 
heat distribution). 

Step 4 [Warping the kernels]: We warp each kernel k from Equation (1) 
to the point cloud to generate a new warped kernel k. Eor any x G 5 and 
any s in the set of labeled data 

(2) fc(x,s|/L) = A;(x,s|Li') -fcJ(/ + ALK)-ULfcs, 

in which fcx = [^(x, zi | H ),..., k(x, zz \ H)] and kg = [k{s, zi | H ),..., 
k{s, Zz I H)] are vectors of kernels evaluated at x or s and the point cloud 
data {zi}. Matrix K = [k{zi,Zj \ H)]. is a symmetric matrix of 

kernels evaluated at all pairs of point cloud data, and I is a Z hy Z identity 
matrix. The parameter A > 0 represents the degree of deformation. When 
A = 0, we have k = k. When A —)■ oo, /c approaches a positive constant on 
the point cloud (steady state heat distribution). 

Equation (2) is obtained by warping the Reproducing Kernel Hilbert 
Space (RKHS) associated with the chosen kernel. We modify the RKHS 
with a point-cloud semi-norm XL. This deforms the kernel k along a finite¬ 
dimensional subspace given by the point cloud data. The modified RKHS is 
shown to be another RKHS, i.e., k is a properly defined kernel. See Sind- 
hwani, Niyogi and Belkin (2005) and Belkin, Niyogi and Sindhwani (2006) 
for more details (they use the point cloud semi-norm of AL^; we consider 
the simplified case where p = 1). 

There are three interpretations of this type of kernel warping. The first 
is that of heat flow as mentioned before. We allow information (heat) to 
spread along the graph of the point cloud (approximately the manifold of 
Melbourne’s geography). The second interpretation is a graph regularizer. 
Variations between adjacent nodes in the graph are penalized, and thus 
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violation of the spatial structure implied by the point cloud are penalized. 
Lastly, in the Bayesian framework, kernel warping can informally be thought 
of as imposing a data-dependent informative prior to describe our belief of 
the data geometry. 

We replace the regular Gaussian kernel k in Equation (1) with the new 
warped kernel k defined in (2) to predict the density of ambulance demand 
fu at a future time period u. We set a priori the sliding window width M, 
the point cloud data type / size, the number of nearest neighbors n, and the 
weights used to construct the Laplacian matrix. We estimate the Gaussian 
kernel bandwidth H and the degree of deformation A. 

We show in Figure 2 (b) and (c) examples of warping kernels. Three 
kernels of bandwidth H = diag(2, 2) are placed on three observations circled 
in red in Figure 2 (a). They are warped towards the point cloud in (a) with 
degree of deformation A = 0.5 (b) and 2 (c). With a larger A, the kernels 
conform to the spatial boundary and characteristics to a greater extent. 



Fig 2: Examples of kernel warping: (a) the adjacency graph of a sample 
point cloud of size 1000; three observations are highlighted in red; (b) and 
(c), warped kernels centered at the these three observations with degrees of 
deformation A = 0.5 and 2, respectively. 


2.3. Spatio-temporal kernel warping. Melbourne’s ambulance demand shows 
substantial density variations with patterns in time (midnight vs rush hour) 
and in space (downtown vs neighborhoods). It may be beneficial to allow 
bandwidth H and degree of deformation A to vary with time and space. Ide¬ 
ally, we would like to find, in time and space, pockets of the point process 
with similar characteristics, and apply similar smoothing and deformation. 

We discretize time according to our modeling aims, i.e., into 1-hour time 
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periods. For each hour, we further discretize the spatial domain into a small 
number of regions, as motivated by the behavior of labeled data for that 
time period. We call each subregion of each hour a component, and perform 
estimations and predictions independently on each component. The spatial 
discretization splits a global point cloud into local ones, cuts all edges con¬ 
necting across regions, and decomposes the Laplacian matrix into blocks. 
Labeled data are also matched into components. We estimate a separate set 
of H and A for each component by cross-validation (details in Section 2.4). 

We discretize spatially by clustering. For any given future time period, we 
cluster on its labeled data (about 300 observations). We allow different num¬ 
bers of clusters and clustering configurations for each time period. In our 
application, this gives more accurate predictions than imposing a universal 
clustering configuration across time. We also obtain better results by cluster¬ 
ing on labeled data rather than clustering on the point cloud data (the point 
cloud is much more similar across time than the labeled data). In the case 
of Melbourne, spatial characteristics across time are different enough that 
the gain in personalized modeling exceeds the loss in stablization offered by 
a common arrangement. 

We choose to cluster using K-means based on Euclidean distance. K- 
means is fast, clusters all points, and gives even clusters. Even cluster sizes 
are desirable because a very small cluster does not provide enough labeled 
data to reliably estimate parameters via cross-validation. To avoid this, we 
set a threshold minimum number of points in any cluster. We set the thresh¬ 
old at 15 points, which in practice limits the number of clusters to be be¬ 
low 8. If we fail to clear this threshold, we lower the number of clusters. 
Density-based clustering algorithms such as DBSCAN (Ester et ah, 1996) 
and shared nearest neighbors (Ertoz, Steinbach and Kumar, 2003) do not 
classify all points, and do not allow easy specification of the number of clus¬ 
ters. Graph-based clustering algorithms such as affinity propagation (Frey 
and Dueck, 2007) and spectral clustering (Ng, Jordan and Weiss, 2001) do 
not cluster on Euclidean distance, and may be less intuitive for spatial point 
patterns. In our case, hierarchical clustering gives very uneven cluster sizes. 

For each time period, we binary search for the best number of clusters 
based on validation likelihood. Increasing the number of clusters leads to, on 
the one hand, an additional 1000 points to the cloud and the flexibility to 
customize parameters locally, but on the other hand, sparser labeled data for 
each cluster and reduced stability in parameter estimation. It is an empirical 
question for each time period whether we have enough labeled data to afford 
this increase in complexity. For Melbourne, we find the number of clusters 
to be largely proportional to the size of labeled data. 
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2.4. Computation. We estimate the kernel bandwidth H and the degree 
of deformation A for each spatio-temporal component. To reduce the dimen¬ 
sionality, we parametrize iT to be a scalar multiple of the plug-in bandwidth 
Hpi obtained if we fit an unwarped KDE for the same component. That is 
we define H = aHpi, and estimate a. Alternatively, we can dehne a radial 
bandwidth H = diag(/3,/3), reducing the Gaussian kernel to a radial basis 
function. We use H = aHpi because this parametrization gives slightly bet¬ 
ter performance in our preliminary analysis. To estimate a full H is more 
difficult because H needs to be positive semi-dehnite. 

We choose H and A for each component using 5-fold cross-validation to 
maximize average validation likelihood. We implement a surrogate, derivative- 
free optimization procedure called the stochastic radial basis function (RBF) 
method (Regis and Shoemaker, 2007, 2009). It is a fast algorithm for global 
optimization of computationally expensive objective functions. Each itera¬ 
tion builds an RBE model to approximate the expensive function, selects 
subsequent candidate points, and evaluates them in parallel. We choose this 
approach because our objective function (likelihood) evaluation is not in¬ 
stantaneous. It takes between 0.5 and 4 seconds, depending on the sizes of 
the labeled data and point cloud (Python code on a personal computer). We 
also do not have simple derivative computations. In our experience, 100 such 
evaluations are sufficient to provide a good optimum, competitive to those 
found by grid search, pattern search, or evolutionary algorithms. However, 
a wide range of optimization tools can be applied here. 

In our application, we hnd a typical optimal a to be between 0.05 and 0.3. 
We need a concentration of heat which is then spread or warped to the point 
cloud. A typical optimal A is between 0 and 2. Most time periods choose 
between 1 and 3 spatial components. We warm start the binary search for the 
number of clusters based on the size of labeled data. The best configuration 
is usually found within 3 searches. 

Given the prominent weekly seasonality, we believe that the corresponding 
parameter values are also similar from week to week. In fact, we believe 
that the nature of deformation and smoothing does not vary significantly 
over several months, and thus only estimate the parameters for a one-week 
cycle once every few months. With the most recent weekly set of parameter 
values, we predict forward in an online fashion with a sliding window of 
M = 8 weeks, making use of the most recent 8 weeks of data available. Each 
prediction is instantaneous. 

The most expensive part of the computation is evaluating kernels be¬ 
tween all pairs of point cloud data and taking the inverse of a large ma¬ 
trix. Several local point clouds of reasonable sizes (< 2000) is computation- 
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ally more efficient than one massive global point cloud. There are ways to 
optimize this computation, including using right division instead of inver¬ 
sion, saving pre-computed kernel evaluation matrices and vectors, exploiting 
sparse, banded-diagonal Laplacian matrix, using a tree-based algorithm for 
fast KDE computation (Gray and Moore, 2003), and using a look-up table 
for Gaussian densities (most of these optimizations are not used in our im¬ 
plementation). The computation is “embarrassingly parallelizable”, across 
validation likelihood evaluations and across spatio-temporal components. 

3. Predicting ambulance demand for Melbourne. We would like 
to predict ambulance demand in Melbourne for every 1-hour period in March 
2011. There are two stages to this computation. In the first stage, we esti¬ 
mate all parameters for a weekly cycle. The parameters include the spatial 
clustering configuration for each 1-hour period, as well as the parameters 
A (degree of warping) and a (in bandwidth H = aHpi) for each spatial 
component in each 1-hour period. This estimation only needs to be per¬ 
formed very infrequently, and in our case, once. For this estimation, we use 
Melbourne ambulance demand data from 8 weeks in January and February 
2011. In the second stage, we use the estimated weekly set of parameter 
values to predict future ambulance demand on a sliding window of 8 weeks 
for each 1-hour period in March 2011. 

Figure 3 shows the predictive density estimated by kernel warping for 
two time periods on March 2, 2011 (Wednesday). We have only about 150 
labeled data to predict for 2 - 3 am (a), and cross-validate to use only 1 
spatial component. We have almost 400 labeled data for 2-3 pm (b) and 
cross-validate to choose 5 spatial components. 

We consider two variations in estimation: (i) spatio-temporal kernel warp¬ 
ing (S-T param), in which we separately estimate parameters for each 1-hour 
period and spatial region (via clustering. Section 2.3), and (ii) temporal ker¬ 
nel warping (T param), in which we separately estimate parameters for 
each 1-hour period (no spatial clustering). We show in Figure 4 the predic¬ 
tive densities produced by these two approaches for the same time period. 
The densities look similar, with slightly more details when we use spatio- 
temporal kernel warping (we cross-validate to select 3 spatial clusters). 

We compare the proposed kernel warping models to the following 

(a) The MFDIG method, which is an industry practice implemented in 
Charlotte-Mecklenburg, NC (Section 1). We implement this method 
as far as we have data. The cell count in a 1-km^ region and a 1-hour 
period is predicted by the average of corresponding cell counts in the 
preceding 8 weeks. 
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(a) 


(b) 



km eastward from center of Melbourne 


Fig 3: Log predictive densities using spatio-temporal kernel warping for 
March 2, 2011 (Wednesday) at (a) 2 - 3 am (night), and (b) 2 - 3 pm 
(day). For time period (a), we have sparse data and cross-validate to choose 
1 spatial component. For time period (b), we have more data and choose 5 
spatial components. 


(b) Unwarped KDE, as in Equation (1). The bandwidth H is chosen via 
the plug-in method (PI) (Wand and Jones, 1994) and smoothed cross- 
validation (SCV) (Duong and Hazelton, 2005). This H is separately 
estimated for each time period, but does not vary in space. 

(c) Gaussian mixture model (GMM) (Zhou et ah, 2015), in which the 
means and covariances of Gaussian components are fixed through time, 
and the mixture weights vary in time. We also use labeled data from 
the last 8 weeks, and consider 15, 30 and 50 components. The compu¬ 
tational expenses are substantial. 

Figure 5 shows the log predictive density using the MEDIC method, un¬ 
warped KDE (PI), and GMM (30 components) for March 2, 2011 at 2 - 
3 pm. These densities are comparable with Eigure 3 (b), which shows the 
log predictive density for the same period predicted by the proposed kernel 
warping. Even with 400 labeled data. The MEDIC method gives exceedingly 
noisy predictions, while unwarped KDE and GMM produce over-smoothed 
densities that do not adapt well to the spatial features of Melbourne. 

We use several performance metrics to compare the statistical predic¬ 
tive accuracies of different methods. First, we use average log score (ALS) 
(Good, 1952). This metric is advocated for being a strictly proper scoring 
rule closely related to Bayes factor and Bayes information criterion (Gneit- 
ing and Raftery, 2007). It is the average log likelihood of test data. For each 
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Fig 4: Log predictive densities for March 2, 2011 (Wednesday) at 10 - 11 
am using (a) spatio-temporal kernel warping (3 spatial clusters), and (b) 
temporal kernel warping. The density in (a) shows slightly more details. 

test time period u in the set of all test time periods %est, 

riu 

ALS (m) = ^log/„(s„,j), 

i=l 

in which are the test data, and /«(•) is the predictive density for period 

u obtained by various methods. For the MEDIC method, we normalize cell 
counts to discrete density by dividing over the total count in each period. 

Secondly, we compare accuracy in cell counts for every 1-km^ region and 1- 
hour period. For the proposed kernel warping, unwarped KDE, and GMM, 
we discretize continuous predictions in space to each 1 km^, and convert 
to counts by multiplying the total count for the period as predicted by 
the MEDIC method. We compute the root-mean-square error, both within 
the smallest rectangle enclosing all data (plotting window in Figures 1, 3 - 
5) (RMSE) and within a pixelated data-driven boundary of Melbourne B 
(RMSEs). For each test time period u G Ttest, 


where C is the number of 1 km^ cells in the rectangular observation window, 
yu,c and iju^c are the actual and predicted count for period u and cell c 
respectively. For RMSE^, we use cells c within the pixelated boundary B 
and C as the number of 1 km^ cells within this boundary. 


RMSE (u) = 


\ 


c 


Q 'y^Xyu,c yu,c 


C=1 
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(a) 


(c) 
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Fig 5: Log predictive densities using comparison methods for 2-3 pm on 
March 2, 2011 (Wednesday): (a) the MEDIC method (an industry practice); 
(b) unwarped KDE with bandwidth selected by the plug-in method (PI); (c) 
time-varying Gaussian mixture model with 30 components. These densities 
are to be compared to Eigure 3 (b), which is the prediction using kernel 
warping for the same period. 

Additionally, since these cell counts (mostly Os and Is) are more appropri¬ 
ately modeled by a discrete distribution such as the Poisson distribution, we 
also compute the root-mean-square Anscombe residuals (Anscombe, 1953; 
McCullagh and Nelder, 1989), which specihcally adjusts to measure predic¬ 
tive accuracy for Poisson data. Similarly, we consider within all of the rect¬ 
angular window (ANSC) and within the boundary of Melbourne (ANSC^). 
Using the same notations as above. 



ANSC 


and ANSCs is similarly defined. We show in Table 1 the mean predic¬ 
tive accuracies of various methods, averaged across all test time periods 
Ttest (all 1-hour periods in March 2011). A less negative ALS, and smaller 
RMSE, RMSE^, ANSC, and ANSCs indicate better predictive accuracy. 
Both versions of kernel warping have a significant advantage over the com¬ 
parison methods in all performance measures, especially in RMSE^ and 
ANSCs. Between the two versions of kernel warping, allowing parameters 
to be location-specific (in addition to being time-specific) provides additional 
benefits, even though a large number of time periods choose to use only 1 
spatial component. We further show in Figure 6 the box-plots illustrating 
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the variations of some of these metrics across time periods. Kernel warping 
has not only the best mean performance, but also the smallest variations 
across time periods. 


Prediction method 

ALG 

RMSE 

RMSEb 

ANSC 

ANSCs 

Kernel warping 

S-T param 

-7.53 

0.0500 

0.0498 

0.176 

0.171 


T param 

-7.56 

0.0518 

0.0514 

0.178 

0.172 

(a) MEDIC 


-10.11 

0.0589 

0.0996 

0.479 

0.810 

(b) Unwarped KDE 

PI 

-8.14 

0.0562 

0.0950 

0.199 

0.334 


scv 

-8.15 

0.0562 

0.0950 

0.194 

0.325 

(c) GMM 

15 comp 

-7.96 

0.0562 

0.0949 

0.181 

0.304 


30 comp 

-7.87 

0.0561 

0.0948 

0.191 

0.323 


50 comp 

-7.93 

0.0561 

0.0949 

0.188 

0.316 


Table 1 

Mean predictive accuracies across all 1-hour periods in March 2011 of the proposed kernel 
warping and competing methods. Kernel warping outperforms the competing methods. 



Fig 6: Box-plots of predictive accuracies of kernel warping (S-T parameters), 
GMM (30 comp), KDE (PI bandwidth), and the MEDIC method (an indus¬ 
try practice) over 672 test periods, as measured by average log score (left, 
less negative is better), RMSE^ (middle, smaller is better), and ANSCs 
(right, smaller is better). 


4. Conclusions. Fine-resolution spatio-temporal ambulance demand 
predictions are critical to optimal ambulance planning. Typical challenges 
include data sparsity at the prediction resolution and incorporation of com¬ 
plex urban spatial domains. These challenges are especially prominent in 
Melbourne. They create a tension; overcoming sparsity requires consider¬ 
able smoothing, while representing complex spatial features requires fine 
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resolution modeling. Most current industry practices and earlier studies are 
ill-equipped to address these challenges simultaneously. We propose a kernel 
warping method that smooths intelligently towards geographical characteris¬ 
tics. We demonstrate that our proposed method predicts ambulance demand 
in Melbourne more accurately than the state of the art in the practice and 
research of ambulance demand prediction. 

To predict ambulance demand for any hour, we use a spatio-temporal 
kernel density estimator on the sparse set of most similar labeled data, 
but warp these kernels to a larger point cloud drawn from all historical 
observations regardless of labels. We construct an adjacency graph on this 
point cloud to approximate Melbourne’s spatial boundaries, neighborhoods, 
and road networks in a data-driven manner. Kernels on labeled data are 
warped to encourage flow along and penalize flow orthogonal to this graph. 

Kernel warping circumvents the need to define boundaries and bound¬ 
ary conditions, which are often difficult in the practice of modeling point 
patterns on complex spatial domains. It also captures and exploits finer- 
grain internal spatial structures other than boundary features, which can be 
prominent in various heterogeneous environments such as cities, buildings, 
mountains, and forests. Kernel warping is not limited to density estimation. 
It can be adapted to model a wide range of functions and surfaces. It can 
be used to perform a broad set of tasks including prediction, classification, 
clustering, and visualization. Inferences on uncertainty, if desired, can be ob¬ 
tained by assessing cross-validation variance and warping kernels to different 
samples of point clouds. There is much flexibility in designing the point cloud 
and its Laplacian. We offer some discussions on these in the context of spa¬ 
tial and spatio-temporal point patterns. We also offer efficient estimation of 
kernel bandwidth and degree of warping local to time periods and locations 
via cross-validation. The proposed method is straightforward to implement 
and easy to experiment with. The tools we have developed can be easily gen¬ 
eralized to model a wide range of spatial or spatio-temporal point process 
on complex spatial domains. 
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